Section 1: Adding ancestral state reconstruction data to trees
Section 2: Comparing and visualizing different tree topologies
Section 3: Plotting trees with bar plots
#install.packages('ggtree')
#install.packages('ggplot2')
#install.packages('dplyr')
#install.packages('treeio')
#install.packages('phytools')
#install.packages('ape')
#install.packages('ggpubr')
#install.packages('ggnewscale')
#install.packages('ggstance')
library(ggtree)
## Registered S3 method overwritten by 'treeio':
## method from
## root.phylo ape
## ggtree v2.2.1 For help: https://yulab-smu.github.io/treedata-book/
##
## If you use ggtree in published research, please cite the most appropriate paper(s):
##
## [36m-[39m Guangchuang Yu. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics, 2020, 69:e96. doi:10.1002/cpbi.96
## [36m-[39m Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi:10.1093/molbev/msy194
## [36m-[39m Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(treeio)
## treeio v1.12.0 For help: https://yulab-smu.github.io/treedata-book/
##
## If you use treeio in published research, please cite:
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution 2020, 37(2):599-603. doi: 10.1093/molbev/msz240
library(phytools)
## Loading required package: ape
##
## Attaching package: 'ape'
## The following object is masked from 'package:treeio':
##
## drop.tip
## The following object is masked from 'package:ggtree':
##
## rotate
## Loading required package: maps
##
## Attaching package: 'phytools'
## The following object is masked from 'package:treeio':
##
## read.newick
library(ape)
library(ggpubr)
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:ape':
##
## rotate
## The following object is masked from 'package:ggtree':
##
## rotate
library(ggnewscale)
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
setwd('/Users/owlex/Dropbox/Documents/Northwestern/rcs_consult/r_phylogenetics_workshop/r_phylogenetics_worshop') # change this to your R Markdown's file path
Ancestral state reconstruction is a useful method that assess the likelihood of a specific trait occurring given a phylogeny.
The likelihood that a trait occurs can be visualized as a pie graph on each node in a tree
There are a few steps that need to be taken to do this.
The first is to read in metadata that has one column that matches the tip.labels found in a tree
# read in metadata
df_meta <- read.csv('enoyl_metadata.csv')
# read in tree
tree_raxml <- ape::read.tree('RAxML_bipartitions.raw_enoyl_seqs')
# root tree
tree_raxml <- ape::root(tree_raxml, outgroup='tr|E6KYQ2|E6KYQ2_9PAST')
# check that metadata and tree have the same tip labels
setdiff(df_meta$gene_id, tree_raxml$tip.label)
## character(0)
Now that both the metadata and the tree have been read in and verified to have the same tip.labels, we need to select which discrete variable we wish to examine.
We will examine the trait of any given gene conferring high triclosan tolerance (>=64 mg/ml). We need to make a new column, tolerance_binary that provides this as a binary variable
df_meta <- df_meta%>%
mutate(tolerance_binary = case_when(triclosan_tolerance >= 64 ~ 'high',
TRUE ~ 'low'))
Now we will use the ape::ace() package to conduct ancestral trait reconstruction.
This function takes as input a NAMED vector of trait values. Each trait value is associated with one tip.label.
Because we know that the tree tip.label matches the df_meta$gene_id, we will make names() of the traitvals equal to df_meta$gene_id
# vector of trait values
traitvals <- df_meta$tolerance_binary
traitvals
## [1] "high" "high" "high" "high" "high" "low" "low" "low" "low" "low"
## [11] "high" "high" "high" "high" "high" "high" "low" "low" "low" "low"
# associating each value with its corresponding gene_id/tip.label
names(traitvals) <- df_meta$gene_id
traitvals
## sp|Q9HZP8|FABV_PSEAE sp|Q9KRA3|FABV1_VIBCH
## "high" "high"
## sp|Q2P9J6|FABV_XANOM sp|Q73Q47|FABV_TREDE
## "high" "high"
## sp|Q62L02|FABV_BURMA sp|P0AEK4|FABI_ECOLI
## "high" "low"
## sp|P54616|FABI_BACSU sp|Q81GI3|FABI_BACCR
## "low" "low"
## sp|Q6GI75|FABI_STAAR sp|O24990|FABI_HELPY
## "low" "low"
## tr|A0A0M3G239|A0A0M3G239_HAEHA tr|E6KYQ2|E6KYQ2_9PAST
## "high" "high"
## tr|A0A4Y9GUQ1|A0A4Y9GUQ1_9PAST tr|A0A1R0ED53|A0A1R0ED53_HAEPA
## "high" "high"
## tr|B0BP34|B0BP34_ACTPJ sp|P24182|ACCC_ECOLI
## "high" "high"
## sp|P9WGT3|MABA_MYCTU sp|I6Y778|FABG4_MYCTU
## "low" "low"
## sp|P0AEK2|FABG_ECOLI sp|P71534|MABA_MYCS2
## "low" "low"
Now we are ready to conduct the ancestral state reconstruction.
# One idioscynracy of ancestral trait reconstruction is that all branch lengths must be greater than 0. If there are branches with length 0, make them very very small!
tree_raxml$edge.length[tree_raxml$edge.length == 0] <- 1e-7
# run reconstruction model
fitER <- ape::ace(traitvals, tree_raxml,
type="discrete",
method="ML",
CI=TRUE,
model="ARD",
scaled=TRUE,
kappa=1,
corStruct = NULL,
ip = 0.1,
use.expm = FALSE,
use.eigen = TRUE,
marginal = FALSE)
# the reconstruction model gives us a bunch of data stored in the dataframe `fitER`
fitER
##
## Ancestral Character Estimation
##
## Call: ape::ace(x = traitvals, phy = tree_raxml, type = "discrete",
## method = "ML", CI = TRUE, model = "ARD", scaled = TRUE, kappa = 1,
## corStruct = NULL, ip = 0.1, use.expm = FALSE, use.eigen = TRUE,
## marginal = FALSE)
##
## Log-likelihood: -7.001174
##
## Rate index matrix:
## high low
## high . 2
## low 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 0.2924 0.1792
## 2 0.3943 0.2934
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## high low
## 1.000000e+00 9.908523e-09
fitER$se
## [1] 0.1792304 0.2934136
fitER$loglik
## [1] -7.001174
fitER$rates
## [1] 0.2924253 0.3943180
fitER$lik.anc
## high low
## [1,] 0.9999999901 9.908523e-09
## [2,] 0.9997465209 2.534791e-04
## [3,] 0.9994393754 5.606246e-04
## [4,] 0.9892720889 1.072791e-02
## [5,] 0.4964713295 5.035287e-01
## [6,] 0.1701319540 8.298680e-01
## [7,] 0.0423220100 9.576780e-01
## [8,] 0.0425177703 9.574822e-01
## [9,] 0.0011610169 9.988390e-01
## [10,] 0.0004386237 9.995614e-01
## [11,] 0.1360223077 8.639777e-01
## [12,] 0.9999311172 6.888285e-05
## [13,] 0.9999998507 1.493180e-07
## [14,] 0.9999999771 2.289266e-08
## [15,] 0.0246159459 9.753841e-01
## [16,] 0.0057509640 9.942490e-01
## [17,] 0.0001647545 9.998352e-01
## [18,] 0.9998167046 1.832954e-04
## [19,] 0.9999999904 9.643157e-09
We will now make a new dataframe called df_ancstats from the ancestral state reconstruction values
# dataframe of values
df_ancstats <- as.data.frame(fitER$lik.anc)
# specify the node each value belongs to. We do this by getting the range of nodes in the tree and add the total number of nodes to each item the range
1:Nnode(tree_raxml)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Ntip(tree_raxml)
## [1] 20
df_ancstats$node <- 1:Nnode(tree_raxml)+Ntip(tree_raxml)
df_ancstats
## high low node
## 1 0.9999999901 9.908523e-09 21
## 2 0.9997465209 2.534791e-04 22
## 3 0.9994393754 5.606246e-04 23
## 4 0.9892720889 1.072791e-02 24
## 5 0.4964713295 5.035287e-01 25
## 6 0.1701319540 8.298680e-01 26
## 7 0.0423220100 9.576780e-01 27
## 8 0.0425177703 9.574822e-01 28
## 9 0.0011610169 9.988390e-01 29
## 10 0.0004386237 9.995614e-01 30
## 11 0.1360223077 8.639777e-01 31
## 12 0.9999311172 6.888285e-05 32
## 13 0.9999998507 1.493180e-07 33
## 14 0.9999999771 2.289266e-08 34
## 15 0.0246159459 9.753841e-01 35
## 16 0.0057509640 9.942490e-01 36
## 17 0.0001647545 9.998352e-01 37
## 18 0.9998167046 1.832954e-04 38
## 19 0.9999999904 9.643157e-09 39
Now we can plot the values using ggtree() using a function from ggtree called nodepie()
# make a pie chart for each individual probability
pies <- nodepie(df_ancstats,cols=c('high','low'),alpha=0.8)
pies
## $`21`
##
## $`22`
##
## $`23`
##
## $`24`
##
## $`25`
##
## $`26`
##
## $`27`
##
## $`28`
##
## $`29`
##
## $`30`
##
## $`31`
##
## $`32`
##
## $`33`
##
## $`34`
##
## $`35`
##
## $`36`
##
## $`37`
##
## $`38`
##
## $`39`
# make base tree
base_tree <- ggtree(tree_raxml)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# now make tree with pie charts in the node.s for the ancestral state reconstruction
tree_pie <- base_tree+
geom_inset(pies,width=.1, height=.1)+
geom_tiplab(size=2.6,align = TRUE,offset=.05)+
geom_treescale(x=5, color=NA)
## Loading required package: ggimage
##
## Attaching package: 'ggimage'
## The following object is masked from 'package:ggpubr':
##
## theme_transparent
## Warning: Duplicated aesthetics after name standardisation: size
tree_pie
Edit the above visualization so that the node pies are a different color, gene names are used, and a zoomed in visualization is provided.
Associate df_meta with base_tree using the %<+% operator. Set the geom_tiplab() aes(label=gene_name)
Use the lapply() function to changethe color of pies to 'orange' and 'blue'. Store in pies2
Add the new pies2 to geom_inset()
viewClade() to zoom in on node=25 and store in zoom_in_vis. After, use ggpubr::ggarrange() view tree_vis_ex2 and zoom_in_vis side by side.Trees with bootstraps can be visualized so that the other branching patterns are also visible.
Specifically, RAxML outputs a file with the prefix 'RAxML_boostrap.' (different from the one we’ve been using with the prefix 'RAxML_bipartitions.')
If we have run RAxML with 100 bootstraps, then this file will have the 100 different trees that were generated.
We can view all 100 trees using facet_wrap()
# read in tree
trees_raxml <- ape::read.tree('RAxML_bootstrap.raw_enoyl_seqs')
trees_raxml
## 100 phylogenetic trees
# there are multiple trees within trees_raxml
trees_raxml[[1]]
##
## Phylogenetic tree with 20 tips and 18 internal nodes.
##
## Tip labels:
## sp|Q62L02|FABV_BURMA, sp|Q2P9J6|FABV_XANOM, sp|Q9KRA3|FABV1_VIBCH, sp|P24182|ACCC_ECOLI, sp|P54616|FABI_BACSU, sp|Q6GI75|FABI_STAAR, ...
##
## Unrooted; no branch lengths.
trees_raxml[[2]]
##
## Phylogenetic tree with 20 tips and 18 internal nodes.
##
## Tip labels:
## sp|Q62L02|FABV_BURMA, sp|Q9KRA3|FABV1_VIBCH, sp|P24182|ACCC_ECOLI, sp|P54616|FABI_BACSU, sp|Q6GI75|FABI_STAAR, sp|Q81GI3|FABI_BACCR, ...
##
## Unrooted; no branch lengths.
# All of them can be plotted at once
ggtree(trees_raxml)+
facet_wrap(~.id, ncol=10, nrow=10)
If we want to view 10 random trees we can do use the base function sample()
random_trees <- sample(trees_raxml, size=10)
ggtree(random_trees)+
facet_wrap(~.id, ncol=5,nrow=2)
Another neat visualization is using the ggdenistree() function from the ggtree package. This overlays all the trees, with the most dense branches representing higher agreements between trees.
raxml_densitree <- ggdensitree(trees_raxml, alpha=0.4, color='lightblue')+
geom_tiplab(size=3)+
xlim(0,20)
raxml_densitree
The densitree visualiation is a good companion to a traditional bootstrapped tree when plotted together.
Note: we do not root the tree here because the 100 bootstraps are not rooted either.
tree_raxml1 <- ape::read.tree('RAxML_bipartitions.raw_enoyl_seqs')
tree_raxml_best <- ggtree(tree_raxml1)+
geom_nodelab(aes(label=label),nudge_x=-.15,nudge_y=.4,size=2,color='blue')+
geom_tiplab(size=3)+
xlim(0,10)
ggpubr::ggarrange(raxml_densitree, tree_raxml_best)
Some data is associated with phylogenies is better presented as a bar plot. We can plot two separate items, a tree and a bar plot, together in a way that the data lines up using ggtree’s facet_plot()
# read in metadata
df_meta <- read.csv('enoyl_metadata.csv')
# read in tree
tree_raxml <- ape::read.tree('RAxML_bipartitions.raw_enoyl_seqs')
# root tree
tree_raxml <- ape::root(tree_raxml, outgroup='tr|E6KYQ2|E6KYQ2_9PAST')
# make base tree
tree_raxml_visual <- ggtree(tree_raxml)
tree_raxml_visual <- tree_raxml_visual%<+%df_meta+
geom_nodelab(aes(label=node),nudge_x=-.15,nudge_y=.4,size=2,color='blue')+
geom_tiplab(aes(label=gene_name), size=3, align=TRUE)+
geom_treescale(x=3.5, color=NA)
## Warning: Duplicated aesthetics after name standardisation: size
tree_raxml_visual
One thing to note about facet_plot() is that it requires the first column of the dataframe it is associated with to have the same values as the tree’s tip.label AND be in the same order as they appear in the tree!
We will make a single-use dataframe, df_tree_order, to get the tip.label from the tree in the correct order and then merge() it with our metadata dataframe, df_meta. We then subset that dataframe to make a new dataframe called df_meta_tree
# make a dataframe of the tree tip labels and their order of appearance
df_tree_order <- data.frame('tip_label'=tree_raxml$tip.label,
'order'=seq(1:length(tree_raxml$tip.label)))
# merge with the metadata table
df_tree_order <- merge(df_tree_order,df_meta, by.x='tip_label', by.y='gene_id')
# make a dataframe consisting of the tip label and valariable of interest
df_meta_tree <- data.frame('tip_label'=df_tree_order$tip_label,
'tcs'=df_tree_order$triclosan_tolerance)
# plot
facet_plot(tree_raxml_visual, data=df_meta_tree,
geom=geom_barh,
mapping = aes(x = tcs),
stat='identity',
panel='triclosan tolerance'
)
We can make some small adustments using ggplot annotation layers so that the visual is more informative.
facet_plot(tree_raxml_visual, data=df_meta_tree,
geom=geom_barh,
mapping = aes(x = tcs),
stat='identity',
panel='triclosan tolerance'
)+
theme_classic2()+
theme(strip.background = element_blank(),
axis.line.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
There are many more ways to work with tree and bar plot/boxplot data!
Thank you for coming to Day 3!